Resource community topic modeling with spreading activation

ABSTRACT

The present invention relates to computer implemented methods and system for determining relevance measures for computational resources based on their relatedness to a user&#39;s interests. The methods and systems are designed to accept as inputs a collection of unstructured textual data related to resources, and a structured graph of the relationships between resources, to calculate probability distributions of resources over latent communities discovered from the unstructured textual data, to activate the structured graph with these probability distributions, and to spread this activation throughout the graph in a fixed number of iterations. The result of these methods and of the systems implementing these methods is a set of relevance measures attached to the resources in the structured graph.

FIELD OF THE INVENTION

The present invention generally relates to the search for computationalresources within an interconnected network. More particularly, thepresent invention relates to computer implemented methods and systems toattach a relevance measure to computational resources based on theirrelatedness to a user's interests, using the statistical machinelearning technique known as topic modeling.

CROSS-REFERENCE TO RELATED APPLICATIONS

Not applicable.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT

Not applicable.

BACKGROUND OF THE INVENTION

In the context of this application, a resource is defined as any entityrepresented through some textual description within a computationalenvironment. Since any entity can be so described, the universe ofpossible resources comprises the universe of all possible entities,including such computational resources as databases and other datasources; queries against databases; publications, webpages, and othertextual entities; and people.

Entity ranking refers to the assignment of a relevance value to relatedobjects and entities from different sources. For the search of expertsin particular, multiple techniques have been used for this purpose,including probabilistic models and graph-based approaches. Probabilisticmodels measure associations between experts by detecting theirprobability distributions with respect to resources such as documents.Graph-based models utilize predefined interconnections between entitiesto uncover associations.

Topic modeling is a probabilistic generative process designed to uncoverthe semantics of a collection of documents using a hierarchical Bayesiananalysis. The objective of topic modeling is to estimate a probabilisticmodel of a corpus of documents that assigns high probability to themembers of the corpus and also to other “similar” documents. The initialdevelopment of topic models conceptualized topics as probabilisticdistributions over the words in independent documents. Enhancements andmodifications to the basic topic model algorithm that have been proposedinclude the incorporation of authorship information and the use ofmulti-level topic arrangements, where topics at one level are consideredto be distributions of topics at a lower level. None of the currentlyproposed techniques, however, combine the ability to model distributionsof topics, which we call communities, with the use of authorshipinformation in order to generate authors as distributions overcommunities. Moreover, the models using authorship information use theconcept of “authorship” literally, requiring an author over a piece oftext, and do not allow for the use of other structural relationshipsbetween resources, such as the textual description of a data source.

Spreading activation is a theory first proposed to model the retrievalcharacteristics of human memory; it postulates that cognitive units forman interconnected network, and that retrieval is achieved through thespread of activation throughout this network. In recent years, thistheory has been successfully applied as a method for associativeretrieval in graph-based computer applications.

Most entity ranking approaches concentrate either on the use ofprobabilistic models over unstructured textual contents, typically usingthe relationship between experts and their publications, or on the useof graph-theoretic approaches over some predetermined relationshipsbetween entities. It seems clear that to achieve better accuracy onrelevance rankings with respect to user expectations, it is necessary tocombine both the unstructured and structured information within a singleframework, and to enable the modeling of communities of resources.Accordingly, it is desirable to derive systems and methods that fulfillthese characteristics and that overcome existing deficiencies in thestate of the art.

SUMMARY OF THE INVENTION

In accordance with the present invention, computer implemented methodsand systems are provided for determining relevance measures forcomputational resources based on their relatedness to a user'sinterests.

In accordance with some embodiments of the present invention, inresponse to receiving a structured graph of interconnections betweenresources and a set of unstructured textual data attached to theseresources, calculations are performed to define the relatedness of eachof the resources in the graph to the user performing the search. In someembodiments, an additional input consisting of keywords is alsoprovided, to guide the search results. The calculations consist in thediscovery of latent topics as probability distributions over words, oflatent communities as probability distributions over topics, of theprobability distribution of resources over communities, and of therelevance ranking based on these distributions. In some embodiments,these distributions are subsequently processed by spreading activationover the structural graph of resources, deriving a final relevanceranking.

There has thus been outlined, rather broadly, the more importantfeatures of the invention in order that the detailed description thereofthat follows may be better understood, and in order that the presentcontribution to the art may be better appreciated. There are, of course,additional features of the invention that will be described hereinafterand which will form the subject matter of the claims appended hereto.

In this respect, before explaining at least one embodiment of theinvention in detail, it is to be understood that the invention is notlimited in its application to the details of construction and to thearrangements of the components set forth in the following description orillustrated in the drawings. The invention is capable of otherembodiments and of being practiced and carried out in various ways.Also, it is to be understood that the phraseology and terminologyemployed herein are for the purpose of description and should not beregarded as limiting.

As such, those skilled in the art will appreciate that the conception,upon which this disclosure is based, may readily be utilized as a basisfor the designing of other structures, methods, and systems for carryingout the purposes of the present invention. It is important, therefore,that the claims be regarded as including such equivalent constructionsinsofar as they do not depart from the spirit and scope of the presentinvention.

These together with other objects of the invention, along with thevarious features of novelty which characterize the invention, arepointed out with particularity in the claims annexed to and forming apart of this disclosure. For a better understanding of the invention,its operating advantages and the specific objects attained by its uses,reference should be had to the accompanying drawings and descriptivematter in which there is illustrated preferred embodiments of theinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

Additional embodiments of the invention, its nature and its variousadvantages, will be more apparent upon consideration of the followingdetailed description, taken in conjunction with the accompanyingdrawings, in which like reference characters refer to like partsthroughout, and in which:

FIG. 1 is a simplified illustration of the process for calculation ofrelevance measures from a graph of structural connections betweenresources, and from a set of unstructured texts associated with theseresources.

FIG. 2 is an illustration of the resource community topic modelingalgorithm, using standard plate notation.

DETAILED DESCRIPTION OF THE INVENTION

The following description includes many specific details. The inclusionof such details is for the purpose of illustration only and should notbe understood to limit the invention. Moreover, certain features whichare well known in the art are not described in detail in order to avoidcomplication of the subject matter of the present invention. Inaddition, it will be understood that features in one embodiment may becombined with features in other embodiments of the invention. FIG. 1 isan illustration of the process matter of this patent application, whichshows inputs 10 and 11, processes 20 and 22, intermediate results 21,and output 30. Input 10 is a set of unstructured text documentsassociated with resources. Input 11 is a structured graph of resourcesinterconnected with each other through various relations, henceforthcalled SG. Input 12 is a set of keywords. Process 20 is the resourcecommunity topic modeling process, which uses input 10 to produce a setof topic and community probability distributions labeled as 21. Process22 is a spreading activation process that uses inputs 11 and 12, andintermediate results 21, in order to produce a set of relevance measuresfor resources. This set of relevance measures is the output of theprocess, labeled 30.

Resource Community Topic Model

The resource community topic model is defined by the Bayesian networkdepicted in plate notation in FIG. 2. Here, plate 100 denotes the“document plate,” which represents a single document; plate 101 denotesthe “word plate,” representing each word in a document, plate 102denotes the “topic plate,” representing each latent topic, plate 103denotes the “community plate,” representing each latent community, andplate 104 denotes the “resource plate,” representing every resource. Theresource community topic model is a generative process that performs asfollows:

-   -   The set of resources associated with each document, labeled as        14, generates a single resource r, labeled as 13, from a uniform        probability distribution.    -   From resource r, a community c, labeled as 12, is generated        based on a mixture proportion Ψ=P(c|r), labeled as 22. The        mixture proportion Ψ is modeled as a multinomial distribution        with a Dirichlet prior distribution P(Ψ)=Dir(γ). The prior        parameter γ is labeled 32.    -   From community c, a topic z, labeled as 11, is generated based        on a mixture proportion θ=P(z|), labeled as 21. The mixture        proportion θ is modeled as a multinomial distribution with a        Dirichlet prior distribution P(θ)=Dir(γ). The prior parameter α        is labeled 31.    -   From topic z, a word w, labeled as 10, is generated based on a        mixture proportion Φ=P(w|z), labeled as 20. The mixture        proportion Φ is modeled as a multinomial distribution with a        Dirichlet prior distribution P(Φ)=Dir(β). The prior parameter β        is labeled 30.

The complete likelihood of generating the corpus, i.e., the jointdistribution of all known and hidden variables, given the parameters, isspecified by:

$\begin{matrix}{{P\left( {D,Z,C,R,\psi,\theta,{\varphi \alpha},\beta,\gamma} \right)} = {\left( {\prod\limits_{w_{i} \in W}\; {{P\left( {w_{i}\varphi} \right)}{P\left( {z_{j}\theta} \right)}{P\left( {c_{k}\psi} \right)}{{P\left( r_{m} \right)} \cdot {P\left( {\theta \alpha} \right)}}}} \right){{P\left( {\varphi \beta} \right)} \cdot {P\left( {\psi \gamma} \right)}}}} & (1)\end{matrix}$

where z_(j), c_(k), and r_(m) are indicators that choose a topic,community, and resource for every word w_(i), and Φ, θ, and Ψ arevectors containing all the values for Φ, θ, and Ψ for every w_(i),z_(j), and c_(k). Integrating out Φ, ∂, and Ψ, and summing over z_(j),c_(k), and r_(m), we obtain

$\begin{matrix}{{P\left( {{D\alpha},\beta,\gamma} \right)} = {\int{\int{\int{\left( {\prod\limits_{w_{i} \in W^{d}}{\sum\limits_{r_{m} = R}\; {\sum\limits_{c_{k} = C}\; {\sum\limits_{z_{j} = Z}\; {{P\left( {w_{i}\varphi} \right)}{P\left( {z_{j}\theta} \right)}{P\left( {c_{k}\psi} \right)}{P\left( r_{m} \right)}{P\left( {\theta \alpha} \right)}}}}}} \right){P\left( {\varphi \beta} \right)}{P\left( {\psi \gamma} \right)}{\psi}{\theta}{\varphi}}}}}} & (2)\end{matrix}$

Gibbs Sampling

Exact inference over such a model is generally intractable, as itrequires summing over all possible researcher, community and topicassignments. To avoid this, we use Gibbs Sampling, a Markov Chain MonteCarlo algorithm that provides a good approximate inference for highdimensional models while using relatively simple processes. We constructa Markov chain that converges to the posterior distribution over thelatent variables r, c, and z conditioned on D, α, β, γ, and R. Let usdenote the assignment of resources, communities and topics to wordsother than w_(i) as R_(—) _(i) , C_(—) _(i) , and Z_(—) _(i)respectively, The Gibbs sampling update equation calculates theprobability of assignment to r_(m), c_(k), and z_(j) given w_(i), andgiven the set of assignments to the other words as:

P(r=r _(m) ,c=c _(k) , z=z _(j) |w=w _(i) ,R _(—) _(i) , C _(—) _(i) ,Z_(—) _(i) , W _(—) _(i) )∝P(w=w _(i) |W _(—) _(i) ,z=z _(j) ,Z _(—) _(i))P(z=z _(j) |Z _(—) _(i) ,c=c _(k) ,C _(—) _(i) )P(c=c _(k) |C _(—) _(k),r=r _(m) ,R _(—) i)   (3)

since the distributions W, Z, C, and R are assumed conditionallyindependent. Note that because the distribution over resources isuniform, P(r) is constant and can be obviated from the proportionality.Each of the terms in the right hand side of equation (3) is an estimateof the random variables θ, Ψ, and Φ:

$\begin{matrix}{{\left. \varphi_{ij} \right.\sim{P\left( {{w = {w_{i}W_{- i}}},{z = z_{j}},Z_{- i}} \right)}} \propto \frac{n_{j,{- i}}^{WZ} + \beta_{ij}}{{\sum\limits_{W_{- i}}\; n_{w}^{WZ}} + {\sum\limits_{W_{- i}}\; \beta_{wj}}}} & (4) \\{{\left. \theta_{jk} \right.\sim{P\left( {{z = {z_{j}Z_{- i}}},{c = c_{k}},C_{- i}} \right)}} \propto \frac{n_{k,{- j}}^{ZC} + \alpha_{jk}}{{\sum\limits_{Z_{- i}}\; n_{z}^{ZC}} + {\sum\limits_{Z_{- i}}\; \alpha_{zk}}}} & (5) \\{{\left. \psi_{km} \right.\sim{P\left( {{c = {c_{k}C_{- i}}},{r = r_{m}},R_{- i}} \right)}} \propto \frac{n_{m,{- k}}^{CR} + \gamma_{km}}{{\sum\limits_{C_{- i}}\; n_{c}^{CR}} + {\sum\limits_{C_{- i}}\; \gamma_{cm}}}} & (6)\end{matrix}$

where n_(j,−i) ^(wz) is the number of times word w_(i) was sampled fromtopic z_(j), n_(k,−j) ^(ZC) is the number of times topic z_(j) wassampled from community c_(k), and n_(k,−m) ^(CR) is the number of timescommunity s was sampled from researcher r_(m), all of them excluding thecurrent sample. The summations of counts n_(w) ^(WZ), n_(z) ^(ZC), andn_(c) ^(CR), and of parameters β_(wj), α_(zk), and γ_(cm) in thedenominators are over the universe of words, topics, and documentsrespectively, again excluding the current assignment.

Moment Matching

Uniform Dirichlet parameters are used for β and γ, as they representonly a prior statement on the sparseness of the φ and ψ distributions,and since it has been demonstrated that there is no significant benefitof learning these parameters when applied to information retrieval. Theα parameters must capture the different correlations among topics, andtherefore are not assumed uniform. To estimate their values we applymoment matching as follows:

$\begin{matrix}{\mu_{jk} = {\frac{1}{n_{k} + 1}\left( {{\sum\limits_{d\; \varepsilon \; D}\frac{n_{jk}^{ZC}}{n_{kd}}} + \frac{1}{C}} \right)}} & (7) \\{\sigma_{jk} = {\frac{1}{n_{k} + 1}\left( {{\sum\limits_{d\; \varepsilon \; D}\left( {\frac{n_{jk}^{ZC}}{n_{kd}} - \mu_{jk}} \right)^{2}} + \left( {\frac{1}{C} - \mu_{jk}} \right)^{2}} \right)}} & (8) \\{m_{jk} = {\frac{\mu_{jk}\left( {1 - \mu_{jk}} \right)}{\sigma_{jk}} - 1}} & (9) \\{\alpha_{jk} = \frac{\mu_{jk}^{Z}}{\sum\limits_{Z}{\log \left( m_{jk} \right)}}} & (10)\end{matrix}$

where n_(jk) ^(ZC) is as before, n_(k) is the total number thatcommunity c_(k) has been sampled, n_(kd) is the total number that c_(k)has been sampled for a given document, |C| is the total number ofcommunities, and |Z| is the total number of topics. The moment matchingprocedure calculates the mean μ_(jk), variance σ_(jk), and moment m_(jk)for a pair of topic Z_(j) and community c_(k), and from these values itestimates each hyperparameter α_(jk).

The Gibbs sampling algorithm runs the Markov chain until convergence.After a burn-in period used to eliminate the influence of initializationparameters, a resource, community, and topic assignment is generated foreach word in the corpus using the probability distributions estimated upto that point. This collection of generated values, called a Gibbsstate, is then used to update the estimators with equations (4)-(6) andthe Dirichlet prior α with moment matching.

Spreading Activation

Spreading activation is applied over the probability distributionsobtained through the resource community topic model using a breadthfirst search of the SG. An Activation State (AS) is a mapping from nodesin the SG to activation levels, AS: N→R. To form semantic clusters for acommunity, the algorithm first initializes the AS according to thecommunity's distribution over topics, by utilizing named entityrecognition to relate entities in the SG to words in the topic models,augmented with words from natural language statements provided by usersas guiding terms for discovery. It then computes the probability foreach entity y conditioned on the specified community:

$\begin{matrix}{{P\left( {{yc} = c_{k}} \right)} = {\sum\limits_{z_{j}\; \varepsilon \; Z}\left( {\theta_{ij}\psi_{jk}} \right)}} & (11)\end{matrix}$

and sets the activation level of the top-k entities to theircorresponding probability. Activation is then spread over the linkeddata network from the initially activated nodes through multipleiterations, computing activation levels for each node as:

$\begin{matrix}{I_{i} = {\sum\limits_{j}{a_{j}\frac{g_{t}}{n_{j}}}}} & (12) \\{O_{i} = {a_{i} + {\lambda_{i}{s\left( I_{i} \right)}}}} & (13) \\{{s(x)} = \frac{1 - ^{- {ax}}}{1 + ^{- {ax}}}} & (14) \\{a_{i + 1} = \left\{ \begin{matrix}{O_{i},{O_{i} \geq h}} \\{0,{O_{i} < h}}\end{matrix} \right.} & (15)\end{matrix}$

where I_(i) is the input, α_(i) is the current activation, and O_(j) isthe output activation of node i, g_(t) is a gain factor based on therelationship type, n_(i) is the number of outgoing connections of type tfrom node i, λ_(i) is an efficiency factor, and h is a threshold. Thesigmoid function s(x) ensures a maximum input activation of one, andalso attenuates small activation levels, to avoid runaway activation.The set of α_(i+1) constitutes the current AS. Since we are applyingspreading activation for search within a highly connected graph withloops we terminate the algorithm after a set number of iterations. Thefinal AS contains the nodes for the semantic cluster, where theactivation level indicates the relevance of the node to the community.Clusters are formed for topics and documents in an analogous manner.

Considerations on Presentation of the Proposed Process

It is understood herein that the detailed description may be presentedin terms of program procedures executed on a computer or network ofcomputers. These procedural descriptions and representations are themeans used by those skilled in the art to most effectively convey thesubstance of their work to other skilled in the art.

A procedure is here, and generally, conceived to be a self-consistentsequence of steps leading to a desired result. These steps are thoserequiring physical manipulation of physical quantities. Usually, thoughnot necessarily, these quantities take the form of electrical ormagnetic signals capable of being stored, transferred, combined,compared, and otherwise manipulated. It proves convenient at times,principally for reasons of common usage, to refer to these signals asbits, values, elements, symbols, characters, terms, numbers, or thelike. It should be noted, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. Further, themanipulations performed are often referred to in terms, such as addingor comparing, which are commonly associated with mental operationsperformed by a human operator. No such capability of a human operator isnecessary in any of the operations described herein which form part ofthe present invention; the operations are machine operations. Usefulmachines for performing the operation of the present invention includegeneral purpose digital computers or similar devices.

The present invention also relates to apparatus for performing theseoperations. This apparatus may be specially constructed for the requiredpurpose or it may comprise a general purpose computer as selectivelyactivated or reconfigured by a computer program stored in the computer.The procedures presented herein are not inherently related to aparticular computer or other apparatus. Various general purpose machinesmay be used with programs written in accordance with the teachingsherein, or it may prove more convenient to construct more specializedapparatus to perform the required method steps. The required structurefor a variety of these machines will appear from the description given.

The system according to the invention may include a general purposecomputer, or a specially programmed special purpose computer. The usermay interact with the system via e.g., a personal computer or over asmartphone, the Internet, an Intranet, etc. Either of these may beimplemented as a distributed computer system rather than a singlecomputer. Moreover, the processing could be controlled by a softwareprogram on one or more computer systems or processors, or could even bepartially or wholly implemented in hardware.

Portions of the system may be provided in any appropriate electronicformat, including, for example, provided over a communication line aselectronic signals, provided on CD and/or DVD, provided on optical diskmemory, etc.

Any presently available or future developed computer software languageand/or hardware components can be employed in such embodiments of thepresent invention. For example, at least some of the functionalitymentioned above could be implemented using Visual Basic, C++, or anyassembly language appropriate in view of the processor being used. Itcould also be written in an object-oriented and/or interpretiveenvironment such as Java and transported to multiple destinations tovarious users.

It is to be understood that the invention is not limited in itsapplication to the details of construction and to the arrangements ofthe components set forth in the following description or illustrated inthe drawings. The invention is capable of other embodiments and of beingpracticed and carried out in various ways. Also, it is to be understoodthat the phraseology and terminology employed herein are for the purposeof description and should not be regarded as limiting.

As such, those skilled in the art will appreciate that the conception,upon which this disclosure is based, may readily be utilized as a basisfor the designing of other structures, methods, and systems for carryingout the several purposes of the present invention. It is important,therefore, that the claims be regarded as including such equivalentconstructions insofar as they do not depart from the spirit and scope ofthe present invention.

Although the present invention has been described and illustrated in theforegoing exemplary embodiments, it is understood that the presentdisclosure has been made only by way of example, and that numerouschanges in the details of implementation of the invention may be madewithout departing from the spirit and scope of the invention, which islimited only by the claims which follow.

What is claimed is:
 1. A computer implemented method for estimating latent topics and latent communities between resources, the method comprising: receiving a set of unstructured textual data associated with resources; calculating latent topics as probability distributions over words within the unstructured textual data, latent communities as probability distributions over topics, and resources as probability distributions over communities;
 2. The method of claim 1, wherein the resources to be processed are specifically references to people;
 3. The method of claim 1, wherein the resources include any subset of entities represented through some textual description within a computational environment.
 4. A computer implemented method for assigning to computational resources relevance measures with respect to a user, the method comprising: utilizing the computer implemented method of claim 1 to calculate probability distributions of resources over communities; calculating relevance measures as the similarity between probability distributions of resources with the probability distributions of users;
 5. The method of claim 4, wherein the resources to be processed are specifically references to people;
 6. The method of claim 4, wherein the resources include any subset of entities represented through some textual description within a computational environment;
 7. A computer implemented method for assigning to computational resources relevance measures with respect to a user, the method comprising: utilizing the computer implemented method of claim 1 to calculate probability distributions of resources over communities; receiving a structured graph of resources linked by relations between them; assigning initial activation states to the nodes in the graph of resources through named entity recognition to relate entities to words in the topic distributions; spreading activation through the graph in a set number of iterations; obtaining a relevance measure as the final activation of the different nodes in the graph.
 8. The method of claim 7, wherein the resources to be processed are specifically references to people;
 9. The method of claim 7, wherein the resources include any subset of entities represented through some textual description within a computational environment;
 10. The method of claim 7, wherein a set of keywords is provided as additional input, and wherein these keywords generate additional initial activation by relating them to words in the topic distributions;
 11. The method of claim 10, wherein the resources to be processed are specifically references to people;
 12. The method of claim 10, wherein the resources include any subset of entities represented through some textual description within a computational environment;
 13. A data processing system for assigning to computational resources relevance measures with respect to a user, the system comprising: a display device; and a processor configured to: receive unstructured textual data related to computational resources; calculate latent topics as probability distributions over words within the unstructured textual data, latent communities as probability distributions over topics, and resources as probability distributions over communities; receive structured data representing a graph of resources linked by relations between them; receive a set of keywords; assign initial activation states to the nodes in the graph of resources through named entity recognition to relate entities to words in the topic distributions; spread activation through the graph in a set number of iterations; obtain a relevance measure as the final activation of the different nodes in the graph. 